Identifying and measuring related queries

ABSTRACT

A system and method are disclosed for identifying similar queries. A user query may be compared with known search keywords. The user query may be a Chinese related query, which is converted into a different form before comparing with other converted queries or keywords. A similarity score based on different features may be used for comparing the queries.

This application is a continuation-in-part application to U.S. patent application Ser. No. 11/363,315 (U.S. Pat. Pub. No. 2007/0203894), entitled “SYSTEM AND METHOD FOR IDENTIFYING RELATED QUERIES FOR LANGUAGES WITH MULTIPLE WRITING SYSTEMS,” filed Feb. 28, 2006, the disclosure of which is hereby incorporated by reference.

BACKGROUND

Online advertising may be an important source of revenue for enterprises engaged in electronic commerce. A number of different kinds of web page based online advertisements are currently in use, along with various associated distribution requirements, advertising metrics, and pricing mechanisms. Processes associated with technologies such as Hypertext Markup Language (HTML) and Hypertext Transfer Protocol (HTTP) enable a web page to be configured to contain a location for inclusion of an advertisement. A page may not only be a web page, but any other electronically created page or document. An advertisement can be selected for display each time the page is requested, for example, by a browser or server application.

Online advertising may be linked to online searching. Online searching is a common way for consumers to locate information, goods, or services on the Internet. A consumer may use an online search engine to type in a query to search for other pages or web sites with information related to that query. When the advertising that is shown on the search engine page is related to the query, the search may be referred to as a sponsored search. Sponsored searching may require advertisers to bid for search keywords. The search keywords are associated with the search query for displaying advertisements with the search results. It may be difficult to identify which keyword(s) that a search query is related to. In particular, users may enter search queries that are misspelled or that are in a different language.

BRIEF DESCRIPTION OF THE DRAWINGS

The system and method may be better understood with reference to the following drawings and description. Non-limiting and non-exhaustive embodiments are described with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. In the drawings, like referenced numerals designate corresponding parts throughout the different views.

FIG. 1 is a block diagram of an exemplary network system;

FIG. 2 is a block diagram of a language analyzer;

FIG. 3 is a block diagram of exemplary conversion forms;

FIG. 4 is a block diagram of exemplary comparisons of queries;

FIG. 5 is a flow diagram for identifying related queries; and

FIG. 6 is a block diagram of a general computer system for use with the disclosed embodiments.

DETAILED DESCRIPTION

By way of introduction, the embodiments described below include a system and method for identifying and measuring related queries. The embodiments relate to identifying similar Chinese queries. A user query may be compared with known search keywords or other search queries. The search keywords may be used by advertisers for sponsored searching. The user query may be a non-native language query, such as a Chinese related query in an English language website or a query in a Chinese website. The user query is converted into a different form before comparing with other converted queries or the search keywords. For explanation purposes, the embodiments are described in terms of a Chinese related query, but other languages or query platforms may be used. A similarity score based on various features may be used for comparing the queries. Based on the similarity score or other comparison features, the original user query may be substituted by other queries or be associated with one or more search keywords. The associated search keywords may be used for selecting the advertisements that are displayed with the search results for that search query.

Alternatively, related queries may be identified from a reformulation of the original query. The reformulation may be based on stored query logs and used to compare the original query with stored queries. As part of the comparison, various features, including language specific features, may be used to measure query similarity. Based on the query similarity the original query may be substituted for a stored query or search keyword for identifying the relevant advertisements to display. A user's query may be misspelled and the system may identify a related query that is correctly spelled that replaces the initial user query. Chinese related queries may be identified and measured due to an increased interest in Chinese search and advertising markets.

Other systems, methods, features and advantages will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the following claims. Nothing in this section should be taken as a limitation on those claims. Further aspects and advantages are discussed below in conjunction with the embodiments.

FIG. 1 provides a simplified view of a network system 100 in which the present embodiments may be implemented. Not all of the depicted components may be required, however, and some embodiments of the invention may include additional components not shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional, different or fewer components may be provided.

FIG. 1 is a block diagram illustrating an embodiment of an exemplary network system 100 for language analysis and comparison. In particular, system 100 includes a language analyzer 104 that may receive and convert a user's search query for comparison with other queries or search keywords. A user device 106 is coupled with a search engine 102 through the network 109. The search engine 102 is coupled with a search log database 112, and both are coupled with the language analyzer 104. The search log database 112 is coupled with a data source 113 and a unit dictionary 116. An ad server 103 may be coupled with the search engine 102 and/or coupled with the language analyzer 104. Herein, the phrase “coupled with” is defined to mean directly connected to or indirectly connected through one or more intermediate components. Such intermediate components may include both hardware and software based components. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein.

The user device 106 may be a computing device for a user to connect to a network 109, such as the Internet. Examples of a user device include but are not limited to a personal computer, personal digital assistant (“PDA”), cellular phone, or other electronic device. The user device 106 may be configured to access other data/information in addition to web pages over the network 109 with a web browser, such as INTERNET EXPLORER® (sold by Microsoft Corp., Redmond, Wash.). The user device 106 may enable a user to view pages over the network 109, such as the Internet. The user device 106 may be the user device described below with respect to FIG. 6.

The user device 106 may be configured to allow a user to interact with the search engine 102, the ad server 103, or other components of the system 100. In one embodiment, the user device 106 may receive and display a site or page provided by the search engine 102. The user device 106 may include a keyboard, keypad or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to allow a user to interact with the page(s) provided by the search engine 102 and/or the ad server 103.

The search engine 102 is coupled with the user device 106 through the network 109, as well as being coupled with the language analyzer 104, the ad server 103 and/or the search log database 112. In one embodiment, the search engine 102 is a web server. The search engine 102 may provide a site or a page over a network, such as the network 109 or the Internet. A site or page may refer to a web page or a series of related web pages which may be received or viewed over a network. The site or page is not limited to a web page, and may include any information accessible over a network that may be displayed at the user device 106. In one embodiment, a site may refer to a series of pages which are linked by a site map. For example, the web site of www.yahoo.com (operated by Yahoo! Inc., in Sunnyvale, Calif.) may include thousands of pages, which are included at yahoo.com. Hereinafter, a page will be described as a web page, a web site, or any other site/page accessible over a network. A user of the user device 106 may access a page provided by the search engine 102 over the network 109. As described below, the page provided by the search engine 102 may be a search page that receives a search query from the user device 106 and provides search results that are based on the received search query.

The search engine 102 may include an interface, such as a web page, e.g., the web page which may be accessed on the World Wide Web at yahoo.com, which is used to search for pages which are accessible via the network 109. The user device 106, autonomously or at the direction of the user, may input a search query (also referred to as a user query, original query, search term or a search keyword) for the search engine 102. A single search query may include multiple words or phrases. The search engine 102 may perform a search for the search query and display the results of the search on the user device 106. The results of a search may include a listing of related pages or sites that is provided by the search engine 102 in response to receiving the search query.

The ad server 103 is coupled with the search engine 102 and/or the language analyzer 104. The ad server 103 may be configured to provide advertisements to the search engine 102. In an alternate embodiment, the search engine 102 and the ad server 103 may be a common component and/or the search engine 102 may select and provide advertisements. The ad server 103 may include or be coupled with an advertisement database that includes advertisements that are available to be displayed by the search engine 102 for sponsored searching. In addition, the advertisements may be associated with one or more search keywords. The search keywords may be purchased or bid on by advertisers. Accordingly, when that search keyword is searched for, the advertiser who purchased or placed the highest bid is selected and their advertisement is displayed. The ad server 103 may include or be coupled with a database, such as an advertisement database, that stores search keywords and the respective price or bid for each keyword from advertisers that is referenced for each search query. In one embodiment, a search query is received and compared with known search keywords or other search queries when the ad server 103 selects and provides the advertisement to the search engine 102.

The search log database 112 includes records or logs of at least a subset of the search queries entered in the search engine 102 over a period of time and may also be referred to as a search query log, search term database, keyword database or query database. In one embodiment, the search log database 112 may store the search keywords that are used by the ad server 103 in selecting an advertisement for a particular search query. The search log database 112 may include search queries from any number of users over any period of time. Alternatively, the search log database 112 may include records or logs of a subset of the queries or requests for data entered at the search engine 102 over a period of time. The search log database 112 may also store associations between search queries from the search engine 102. For example, a search query may be associated with a search keyword or other search queries after a conversion and comparison by the language analyzer 104 as discussed below.

The search log database 112 may also be coupled with a data source 113. The data source 113 may be an internal source of data, an external source of search data, or a combination of the two. An external data source may include search results from other search engines or other sources. For example, a search engine other than search engine 102 may be an external data source and provide search logs to the search log database 112. An internal data source may include search data or other data from the search engine 102. Other data may include other searching or web browsing tendencies identified by the search engine 102.

The search log database 112 may also be coupled with a unit dictionary 116. The unit dictionary 116 may be a database of user queries or search keywords that are coupled with one another as units. Units may also be referred to as concepts or topics and are sequences of one or more words that appear in search queries. For example, the search query “New York City law enforcement” may include two units, e.g. “New York City” may be one unit and “law enforcement” may be another unit. A unit is a phrase of common words that identify a single concept. As another example, the search query “Chicago art museums” may include two units, e.g. “Chicago” and “art museums.” The “Chicago” unit is a single word, and “art museums” is a two-word unit. Units identify common groups of keywords to maximize the efficiency and relevance of search results. The unit dictionary 116 may include Chinese related queries, as well as Chinese related units that include Chinese characters. Categorization of search queries into units is discussed in commonly owned U.S. Pat. No. 7,051,023 issued May 23, 2006, entitled “SYSTEMS AND METHODS FOR GENERATING CONCEPT UNITS FROM SEARCH QUERIES,” which is hereby incorporated by reference.

The unit dictionary 116 and the categorization of search queries into units may be used to compare and analyze search queries received by the search engine 102. A search query may be broken into units that are compared with units from other queries or search keywords. In one embodiment, past search queries and search keywords are stored in the search log database 112 as units that may be used in an analysis by the language analyzer 104.

In one embodiment, the ad server 103, the search engine 102 and/or the search log database 112 may be coupled with the language analyzer 104. The language analyzer 104 receives a user query from the user device 106 and matches or identifies other queries or search keywords. The user query may be converted to a different form for comparing various features of the user query with search keywords as discussed with respect to FIG. 2.

The language analyzer 104 may be a computing device as described below with respect FIG. 6. In one embodiment, the language analyzer 104 includes a processor 105, memory 107, software 108 and an interface 110. The language analyzer 104 may be a separate component from the search engine 102 and the ad server 103. In an alternative embodiment, any of the language analyzer 104, search engine 102, and the ad server 103 may be combined as a single component. The interface 110 may communicate with any of the search engine 102, search log database 112, and ad server 103. In one embodiment, the interface 110 may include a user interface configured to allow a user to interact with any of the components of the language analyzer 104. For example, a user may be able to modify the conversion form or comparison features that are used by the language analyzer 104.

The processor 105 in the language analyzer 104 may include a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP) or other type of processing device. The processor 105 may be a component in any one of a variety of systems. For example, the processor 105 may be part of a standard personal computer or a workstation. The processor 105 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data. The processor 105 may operate in conjunction with a software program, such as code generated manually (i.e., programmed).

The processor 105 may be coupled with a memory 107, or the memory 107 may be a separate component. The interface 110 and/or the software 108 may be stored in the memory 107. The memory 107 may include, but is not limited to computer readable storage media such as various types of volatile and non-volatile storage media, including to random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. In one embodiment, the memory 107 includes a random access memory for the processor 105. In alternative embodiments, the memory 107 is separate from the processor 105, such as a cache memory of a processor, the system memory, or other memory. The memory 107 may be an external storage device or database for storing recorded image data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store image data. The memory 107 is operable to store instructions executable by the processor 105.

The functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor executing the instructions stored in the memory 107. The functions, acts or tasks are independent of the particular type of instruction set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like. The processor 105 is configured to execute the software 108. The software 108 may include instructions for analyzing and converting search queries and comparing features with other queries or search keywords.

The interface 110 may be a user input device or a display. The interface 110 may include a keyboard, keypad or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to interact with the language analyzer 104. The interface 110 may include a display coupled with the processor 105 and configured to display an output from the processor 105. The display may be a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information. The display may act as an interface for the user to see the functioning of the processor 105, or as an interface with the software 108 for providing input parameters. In particular, the interface 110 may allow a user to interact with the language analyzer 104 to establish a conversion of a user query and the features that are compared in matching a query with a search keyword.

Any of the components in system 100 may be coupled with one another through a network. For example, the language analyzer 104 may be coupled with the search engine 102, search log database 112, or ad server 103 via a network. Any of the components in system 100 may include communication ports configured to connect with a network. The present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that a device connected to a network can communicate voice, video, audio, images or any other data over a network. The instructions may be transmitted or received over the network via a communication port or may be a separate component. The communication port may be created in software or may be a physical connection in hardware. The communication port may be configured to connect with a network, external media, display, or any other components in system 100, or combinations thereof. The connection with the network may be a physical connection, such as a wired Ethernet connection or may be established wirelessly as discussed below. Likewise, the connections with other components of the system 100 may be physical connections or may be established wirelessly.

The network or networks that may connect any of the components in the system 100 to enable communication of data between the devices may include wired networks, wireless networks, or combinations thereof. The wireless network may be a cellular telephone network, a network operating according to a standardized protocol such as IEEE 802.11, 802.16, 802.20, published by the Institute of Electrical and Electronics Engineers, Inc., or a WiMax network. Further, the network(s) may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols. The network(s) may include one or more of a local area network (LAN), a wide area network (WAN), a direct connection such as through a Universal Serial Bus (USB) port, and the like, and may include the set of interconnected networks that make up the Internet. The network(s) may include any communication method or employ any form of machine-readable media for communicating information from one device to another. For example, the ad server 103 or the search engine 102 may provide pages to the user device 106 over a network, such as the network 109. The network or networks described above, including the network 109, may be the network discussed below with respect to FIG. 6.

The ad server 103, the search engine 102, the search log database 112, the language analyzer 104, the unit dictionary 116 and/or the user device 106 may represent computing devices of various kinds, such as the components described with respect to FIG. 6. Such computing devices may generally include any device that is configured to perform computation and that is capable of sending and receiving data communications by way of one or more wired and/or wireless communication interfaces. Such devices may be configured to communicate in accordance with any of a variety of network protocols, as discussed above. For example, the user device 106 may be configured to execute a browser application that employs HTTP to request information, such as a web page, from the search engine 102 or ad server 103. The present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that any device connected to a network can communicate voice, video, audio, images or any other data over a network.

FIG. 2 illustrates an embodiment of a language analyzer. As described with respect to FIG. 1, the language analyzer 104 may convert a search query into a different form for comparing its features with other queries or search keywords that are used for selecting matching advertisements to be displayed on a search results page. The language analyzer 104 may include a receiver 202, a converter 204, a comparator 206, and a calculator 208. As shown, the language analyzer 104 or any of its components may represent computing devices of various kinds, such as the components described with respect to FIG. 6.

The receiver 202 may receive a user query from the search engine 102, which may receive the user query from the user device 106. The receiver 202 may also receive search keywords from the ad server 103. The search keywords may be matched with advertisements, such that when a user inputs the search keyword in a search engine, the search results page includes the matched advertisement. Accordingly, the language analyzer 104 may match user queries with search keywords for selecting advertisements to be displayed on the search query results page.

The converter 204 is coupled with the receiver 202. The converter 204 receives the user query or other search keywords and converts them into a different form for comparison. As described, the user query may be a Chinese related query and the converter 204 may convert the Chinese related query into a different form to aid comparison. A Chinese related query may include any Chinese characters, including Roman characters that represent a Chinese character or phrase. Chinese related queries may also include queries that originate from or are received by a Chinese search engine and may be simplified Chinese and/or traditional Chinese.

FIG. 3 illustrates exemplary conversion forms. In particular, the converter 204 may utilize any of the conversion forms 302 to convert a Chinese related query. The converter 204 may convert a search query into any of the conversion forms 302 to compare the query with other converted queries or converted search terms. As described below, the conversion may include a transformation of the query by adding, deleting, and/or substituting characters or words in the queries. The conversion or transformation may result in a common format or common form that may be used for comparing the queries. The conversion forms 302 shown in FIG. 3 are merely exemplary. In alternate embodiments, there may be additional conversions forms 302 that are not illustrated or described. The conversion may receive a Chinese related query and convert each element or selected elements of the query into an array that represents the converted form of the Chinese related query.

A first conversion form is a conversion into Chinese soundex 304. The Chinese characters are converted into pinyin without tone, while the roman letters remain. The query is then converted into a Chinese soundex-like representation by first retaining the first letter of a string. Second, all occurrences of a, e, h, i, o, and u are removed, unless it is the first letter. Third, characters may be replaced, such as, replacing “zh” with “z,” “ch” with “c,” “sh” with “s,” “ng” with “n,” “rd” with “d,” “rl” with “l,” “rn” with “n,” “rs” with “s,” and/or “rt” with “t.” Fourth, the remaining letters after the first letter are assigned a number, such as, (m, n, l)=1, (b, p)=2, (f, v, w, h)=3, (d, t)=4, (j, z, s, x, q, c, g, k)=5, (r)=6, (y)=7, and (a)=8. Fifth, if two or more letters are adjacent, then the first letter remains and the others are omitted. Sixth, the spaces are removed. Seventh, all characters remaining are returned.

A second conversion form is converting the Chinese characters into the keyboard input form zhuyin (Bopomofo) 306. Each element in the array is either all zhuyin characters for one corresponding Chinese character or a roman character originally in the query without transformation. A third conversion form is a similar zhuyin (Bopomofo) conversion 308, except each element in the array is either one zhuyin character or a roman character originally in the query without transformation.

A fourth conversion form is converting Chinese characters into radicals 310. Each element in the array is either the radical for a Chinese character or the roman character originally in the query without transformation. A radical 310 may be the semantic root (i.e., portion bearing the meaning) of a Chinese character. A radical may be part of a Chinese character and/or the semantic component of this Chinese character. For example, in the character

pronounced as jie with a meaning of “sister”, the left part

(pronounced n{umlaut over ({hacek over (u)})} in Mandarin Chinese) is the semantic component. Chinese characters may have at least one or two radicals. The radicals may be used for Chinese Hanzi. A dictionary may be used to match a Chinese character with its radical(s). When a Chinese character has multiple radicals, the most meaningful radical (which may be identified in a dictionary) may be considered for comparison.

A fifth conversion form is converting Chinese characters into pinyin without tone 412. Each element in the array is either the complete pinyin without tone for one corresponding Chinese character or a roman character originally in the query without any transformation. A sixth conversion form is converting Chinese characters into pinyin without tone 414 in which each element in the array is either one pinyin character or a roman character originally in the query without transformation. Pinyin may be a Standard Mandarin Romanization system. In pinyin, the pin refers to a “spelling” and the yin refers to a “sound.” There may be a pinyin corresponding to each Chinese Character. One pinyin may include more than two roman characters. In the fifth conversion form, each pinyin may be a unit for similarity comparison. In the sixth conversion form, each character within pinyin may be a unit for comparison.

A seventh conversion form is converting Chinese characters into pinyin with tone 416. Each element in the array is either the complete pinyin and its tone for one corresponding Chinese character or a roman character originally in the query without transformation. An eighth conversion form is converting Chinese characters into pinyin with tone 418 in which each element in the array is either one pinyin character, its tone, or a roman character originally in the query without transformation. A ninth conversion form is converting queries into two character-based arrays 420. In particular, if a character is Chinese, three bytes in Chinese (utf8) is an element in the array. In other words, each Chinese character is represented in three bytes. If a character is roman, then the roman character itself is an element.

A tenth conversion form is the removal of Chinese characters 422. The roman characters are left in the query and the Chinese characters are removed. Likewise, an eleventh conversion form removes the roman characters 424, and keeps the Chinese characters in the query. A twelve conversion form includes leaving the query as inputted 426. In other words, the twelve conversion is no conversion 426.

In one embodiment, the receiver 202 receives two queries that are to be compared to determine the similarity between those queries. The queries are converted into at least one of the conversion forms by the converter 204. In one embodiment, both queries are converted into the twelve exemplary conversion forms 302 and the queries are compared in all twelve converted forms. Alternatively, certain conversion forms are selected for converting the queries and the queries are compared for each of those converted forms.

After being converted, the queries may be compared by the comparator 206. The comparator 206 may be configured to perform comparison of a user's search query with other queries or with search keywords that are used by the ad server 103 for displaying relevant advertisements that are linked to particular search keywords. In one embodiment, the comparator 206 determines the similarity between two queries. The queries are first converted into a similar form or similar forms by the converter 204 and each of those forms are compared by the comparator 206. In one embodiment, the queries are converted into the twelve forms illustrated in FIG. 3 and the comparator 206 makes twelve comparisons between the queries for each of the twelve conversions of each query. In alternative embodiments, there may be more or fewer conversion forms that are compared by the comparator 206.

In one embodiment, a user query may be compared with a candidate set of queries to determine which of the candidate set is most similar to the user query. The candidate set may be made up of search keywords which are compared with the user query to determine which search keyword is most similar. The candidate set of queries or keywords for comparison may be chosen based on an initial analysis of the user query compared with the search log database 112. In one embodiment, when the user query is received the candidate set is identified and each member of the candidate set is compared with the user query to determine which is most similar. As described below, a similarity score may be calculated for each member of the candidate set that represents a similarity with the user query. The member of the candidate set with the closest similarity score may be most similar to the user query. In an alternative embodiment, the candidate set may include one query or include all queries, such as those stored in the search log database 112.

FIG. 4 illustrates exemplary comparisons of queries. In particular, the comparator 206 may utilize comparison features 402 when comparing queries. The comparison features 402 shown in FIG. 4 are merely exemplary. In alternate embodiments, there may be additional comparison features 402 that are not illustrated or described. The comparison may involve comparing various forms of converted Chinese related queries. In particular, the comparator 206 may compare an array of elements that is generated by the converter 204 as a converted form of a Chinese related query. In one embodiment, the comparator 402 may compare queries as described in the commonly owned U.S. application entitled, “SYSTEM AND METHOD FOR IDENTIFYING RELATED QUERIES FOR LANGUAGES WITH MULTIPLE WRITING SYSTEMS,” U.S. Pat. Pub. No. 2007/0203894, filed Feb. 28, 2006, the disclosure of which is hereby incorporated by reference.

A first comparison feature may be an edit distance 404 between two queries. The edit distance may be a measure of the difference between two character strings, such as queries. In one embodiment, the edit distance may be a minimum number of edit operations required to transform the first query into the second query. The edit operation may include inserting or deleting a character into a string or replacing a character by another character. In an alternative embodiment, weights may be assigned for different edit operations. For example, a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a. The edit distance may be the Levenshtein distance or the Damerau-Levenshtein distance when a transposition of characters counts as a single edit operation. In alternative embodiments, there may be other algorithms that are used for determining the edit distance between queries or there may be more or fewer edit operations that are used in determining an edit distance between queries.

A second comparison feature may be an edit distance without a domain 406. In particular, two queries may have their domains removed before computing the edit distance. The domain may be a web domain, such as “.com” that is removed. The removal of the domain may be helpful because a user querying “yahoo.com” and “yahoo.net” is likely making the same query. A third comparison feature may be a character level prefix overlap 408. The character level prefix overlap 408 may be a measure of the characters/words that are the same at the beginning of the queries. For example, “auto cleaners” and “auto cleaning” have a prefix overlap of “auto clean.” The prefix overlap may indicate increased similarity. A fourth comparison feature may be a character level suffix overlap 410. The character level suffix overlap 410 measures the similarity between queries at the end of the query. For example, “auto insurance agent” and “home insurance agent” share a suffix overlap of “insurance agent.” Similar, to the prefix overlap, the suffix overlap may indicate increased similarity.

A fifth comparison feature may be a minimum edit distance 412 over all the conversion forms. Likewise, a sixth comparison feature may be a maximum edit distance 414 over all the conversion forms. Given twelve conversion forms and twelve edit distances for each conversion, the minimum edit distance 412 and the maximum edit distance 414 may be identified. In one embodiment, the minimum and maximum may be removed as outliers. Alternatively, the minimum or maximum may be weighted higher when computing a similarity score. A seventh comparison feature may be a minimum edit distance without a domain 416 and an eighth comparison feature may be a maximum edit distance without a domain 418. As discussed above, the domains in a query may not be valuable in terms of determining what the user is searching for, so the domains are removed before comparison.

Additional comparison features may be a word level edit distance 420, a word level prefix overlap 422, or a word level suffix overlap 424. The word level comparisons are similar to the character level comparisons, except entire words are compared rather than individual characters. A length difference 426 between two queries may also be used for comparing.

The comparator 206 may be coupled with a calculator 208 that may calculate a similarity score. The similarity score may be a measure of the similarity between the queries. The similarity score may be calculated based on individual comparisons of different conversion forms of two queries with each individual comparison being assigned a weighted value. The multiple conversion forms described with respect to FIG. 3 may each result in a separate comparison between two queries. Accordingly, using the twelve conversion forms 302, there may be twelve different edit distances or similarity scores, one for each conversion. Those twelve converted forms may be compared and multiplied by a weight for each form to get an overall similarity score between the queries. Alternatively, a subset of the twelve conversion forms or additional conversion forms not described may be utilized to convert Chinese related queries into different forms for comparison.

In one embodiment, the equation presented in Table A may be used to calculate a similarity score indicating the strength of similarity between a query pair. The query pair may include a given query q and a comparison query MODS(q), either of which may be written according to one or more Chinese writing systems. MODS(q) may represent a converted query. In alternative embodiments, both q and MODS(q) may be converted to the same form for comparison, or MODS(q) is converted into a form for comparison with q. MODS(q) may represent a related query that is identified as a potential substitute for the user query q. When MODS(q) has good similarity score with the user original query q, MODS(q) may be used as a search keyword for fetching advertisements. MODS(q) may also be referred to as a rewritten query. Both user original query q and MODS(q) may be converted to the same form for comparison. The equation in Table A makes use of a subset of the conversion forms 302 and the comparison features 402 that are discussed above. In alternative embodiments, different conversion forms or comparison features may be utilized to generate a similarity score. Those of skill in the art recognize that the equation illustrated in Table A is merely exemplary and may be modified so as to provide for the calculation of a similarity score for multiple writing systems. A formula may be optimized based on the source of the query, because queries from Taiwan may be different from queries from Hong Kong. Accordingly, the conversions, comparisons, and weights may be modified for different types of queries. TABLE A $\quad\begin{matrix} {{{LM}\quad 1{{Score}\left( {q,{{MODS}(q)}} \right)}} = {2.542 - {{0.1778.} \times {pq}\quad 12\min\left( {q,{{MODS}(q)}} \right)} -}} \\ {{0.3316 \times {{levroman}\left( {q,{{MODS}(q)}} \right)}} - {{1.064.} \times {agreechar}\left( {q,{{MODS}(q)}} \right)} +} \\ {{1.098 \times {{dlevpynchar}\left( {q,{{MODS}(q)}} \right)}} - {{0.2432.} \times q\quad 1{bidded}\left( {q,{{MODS}(q)}} \right)} +} \\ {{{0.3486.} \times {{wordr}\left( {q,{{MODS}(q)}} \right)}} + {{0.2487.} \times q\quad 2{hasroman}\left( {q,{{MODS}(q)}} \right)} -} \\ {{{0.1284.} \times {pq}\quad 12{\max\left( {q,{{MODS}(q)}} \right)}} - {{0.4667.} \times {{levtaiwanchar}\left( {q,} \right.}}} \\ {\left. {{MODS}(q)} \right) - {{0.2875.} \times {lengthdiffn}\left( {q,{{MODS}(q)}} \right)} - {{0.0006.} \times}} \\ {{{entropy}\quad 21\min\left( {q,{{MODS}(q)}} \right)} - {0.2875 \times {lengthsubtmin}\quad{GT}\quad 3\left( {q,{{MODS}(q)}} \right)}} \end{matrix}$

According to the equation presented in Table A, q represents a given query written according to one or more Chinese writing systems and MODS(q) represents a query selected from a candidate set of potential queries related to query Q. Alternatively, query q may be referred to as query q1 and MODS(q) may be referred to as query q2 or q′. The initial number before each feature is a weight that may be used to emphasize or deemphasize features. The exemplary features utilized in the equation presented in Table A are described below.

Pq12min may be a function for calculating the query substitution probability of query q1 following query q2 in a log of user query sessions, such as from the search log database 112. The search log database 112 may identify the order of the one or more queries submitted by the user, for example, to provide an indication of how the user refined a query, how the user rewrote a query, how the user utilized one or more alternate writing systems of a language with multiple writings systems to express a query Q, etc. When queries q1 and q2 follow one another in a search log database 112, it may be an indication that they are similar because q2 may be a refinement of q1. According to one embodiment, the pq12min function calculates a query substitution probability of a given query q1 following a given query q2, and may also be used to calculate a unit substitution of a unit u following a given unit u′. In one embodiment, pq12min=prob(U_i−>U_i′|U_i)/max_j prob(U_i−>U_j|U_i), where U_i is q1 or its units, U_i′ is possible U_i substitutions, and U_j is q2 or its units. For query suggestions, pq12min may be the normalized probability of q2 as q1's substitution. In one embodiment, a normalized probability is computed of the units in q1 substituted by corresponding units in q2, and take their minimum as pq12min.

Levroman is a comparison using the roman characters of a query, such as with conversion form 322, which removes Chinese characters. For each query all non-roman characters may be removed, but spaces are left in the query. The roman character parts are changed into arrays. Each roman character is an element in the array, including any spaces. The Levenshtein distance is measured between the two arrays. In the case that neither q1 nor q2 has roman characters, levroman is set to 0. In the case that one of q1 or q2 has roman characters but the other does not have roman characters, levroman is set to 1. As an example, consider a first query q1=

map” and a second query q2=

map.” The first query does not include a space before map, but the second query includes a space before map. After the Chinese characters are removed, the queries are converted into arrays, in which q1 is represented as the array:

and q2 is represented as the array:

The Levenshtein distance between the two arrays is one because of the space in the first element of q2. Accordingly because there are four elements, the Levenshtein distance may be represented as ¼=0.25 and levroman is 0.25 for this query pair.

Agreechar may relate to character agreement without removing a space regardless of the order of characters. Agreechar may be similar to wordr discussed below, except it is for the character level rather than the word level. In one embodiment, agreechar is the proportion of unique characters in common between a query pair, such as: ${{agreechar} = \frac{C_{q\quad 1}\bigcap C_{q\quad 2}}{C_{q\quad 1}\bigcup C_{q\quad 2}}},$ in which C_(q1) is the set of unique characters (including space) in q1, and C_(q2) is the set of unique characters (including space) in q2. In the levroman example, q1 and q2 have 7 unique characters in total, which are

“m”, “a”, “p” and a space. Query q1 and q2 share 5 unique characters, which are

“m”, “a” and “p”. Therefore, agreechar is 0.714 (calculated by 5/7) for this query pair.

Wordr is similar to agreechar except is matches words rather than characters. The queries are separated into words, segments, or units as described above. The percentage of unique words not in common is determined for wordr. In other words, wordr=1−proportion of unique words in common, such as ${{wordr} = {1 - \frac{w_{q\quad 1}\bigcap w_{q\quad 2}}{w_{q\quad 1}\bigcup w_{q\quad 2}}}},$ in which w_(q1) is the set of unique words in q1, and w_(q2) is the set of unique words (including space) in q2. In the previous example of levroman,

map” is segmented into two words

and “map” and

map” is segmented into two words

and “map”. There are three unique words and one of them is common between q1 and q2, so wordr is 1−⅓=0.666.

Dlevpynchar utilizes the complete pinyin without tone 312 conversion form. The first query q1 and second query q2 first have a common domain removed and each roman character (including spaces) are kept, while each Chinese character is converted into pinyin without tone. The queries are then transformed into arrays. Each roman character is an element in the array and each Chinese character's pinyin without tone is an element in the array. The Levenshtein distance is then measured. In the example described above, when query q1

map” and query q2

map” where there is no space in query q1, but there is a space in query q2. The first query q1 is converted into an array:

The second query q2 is converted into an array:

The Levenshtein distance is computed between the two arrays to be ⅙=0.167, which may also be the dlevpynchar value for this query pair.

Q1bidded is 1 if q1 is bidded and q1bidded is 0 if q1 is not bidded. When q1 is a user query and q1 is bidded, it may mean that an advertiser chooses q1 as a keyword for the advertisements they want to show. This bidding process may also identify a cost they would like to pay if web searchers click the ads fetched by the keyword. When q1 is not bidded that may mean there are no matched keywords in the advertisement database. Therefore, a query identifying system may identify a related query (e.g. MODS(q)) to substitute for the user query.

Q2hasroman is 1 if q2 contains any roman characters, but not including any spaces. Q2hasroman is 0 if q2 does not contain any roman characters. The queries that are analyzed may be from Chinese search engine or in a search engine that receives Chinese related queries. A Chinese search engine may receive queries with roman characters due to the usage of roman characters in Chinese and the popularity of roman character based languages such as English. The Chinese characters and roman characters maybe processed differently. For example, a Chinese character may be converted into Pinyin for a similarity comparison, while Roman characters are not converted into Pinyin. Accordingly, a similarity score computation may be adjusted based on the presence of Roman characters.

Pq21max may be a function for calculating the query substitution probability of query q1 following query q2 in a log of user query sessions, such as from the search log database 112. In one embodiment, pq21max=prob(U_i−>U_i′|U_i′)/max_j prob(U_i−>U_j|U_j), where U_i is q1 or its units, U_i′ is possible U_i substitutions, and U_j is q2 or its units. The normalized probability may be calculated according above equation for each unit pair in the query pair and the maximum is used as pq21max.

Levtaiwanchar utilizes the removal of roman characters 324 conversion. In particular, all non-Chinese characters are removed and the remaining Chinese character parts are put into an array where each Chinese character is an element in the array. The Levenshtein distance is measured between the two arrays. When neither query q1 nor query q2 includes Chinese characters, levtaiwanchar is 0. When only one of q1 or q2 has Chinese characters levtaiwanchar is 1. In the example described above, when query q1=

map” and query q2=

where there is no space in query q1, but there is a space in query q2. The first query q1 is converted into an array:

Query q2 becomes the array:

Accordingly, the Levenshtein distance is computed between the two arrays, which is ⅓=0.333 and levtaiwanchar is 0.333 for this query pair.

Lengthdiffn is the length difference in characters between q1 and q2, which is normalized by their maximum length in characters. In one embodiment, lengthdiffn is: ${lengthdiffn} = {\frac{{abs}\left( {{{q\quad 1}} - {{q\quad 2}}} \right)}{\max\left( {{{q\quad 1}},{{q\quad 2}}} \right)}.}$

Entropy21min is an uncertainty that may be associated with a similarity between q1 and q2. For a whole query substitution, ${{{entropy}\quad 21\min} = {\sum\limits_{i}{\left( {{{freq}\left( q_{1}\rightarrow q_{2_{i}} \right)}/{{freq}\left( q_{2_{i}} \right)}} \right) \times {\log\left( \left( {{{freq}\left( q_{1}\rightarrow q_{2_{i}} \right)}/{{freq}\left( q_{2_{i}} \right)}} \right) \right)}}}},$ where i is the number of possible q1 query substitutions with q2. For unit substitution, ${{entropy}\quad 21\min} = {\min\limits_{j}{\sum\limits_{i}\left( {{{freq}\left( q_{1j}\rightarrow q_{2j_{i}} \right)}/{{freq}\left( {{q_{2j_{i}}\underset{i}{))} \times {\log\left( \left( {{{freq}\left( q_{1j}\rightarrow q_{2j_{i}} \right)}/{{freq}\left( q_{2j_{i}} \right)}} \right) \right)}},} \right.}} \right.}}$ where j is the number of unit substitution between q1 and q2, and i is the number of possible q1 j's unit substitutions.

LenthsubtminGT3 utilizes a substitution of characters. For query suggestions, lengthsubstminGT3 is 1 if the minimum length of q1 and q2 is less than 3 in characters. Otherwise, lengthsubstminGT3 is 0. For unit suggestions, lengthsubstminGT3 is 1 if the minimum length of any of the substitution units in characters is greater than 3. Otherwise, lengthsubstminGT3 is 0. Query suggestion may refer to a generation of related queries based on an original user query. The user query may be broken into units as described above. A related unit may be found for each unit and combined to form a related query. For example, when a user enters a query for “New York hotel,” it may be split into two units “New York” and “hotel.” “New York” may be rewritten to a related query “Manhattan” and “hotel” may be rewritten to “motel.” Accordingly, “Manhattan motel” may be a related candidate query for an original user query of “New York hotel.”

As described, the equation in Table A and the corresponding features that are used to calculate a similarity score in the calculator 208 are exemplary. Alternatively, a different equation, different weights and different features may be utilized to compute a similarity score. For example, the edit distance may be computed for each of the comparison forms 302 and averaged to become the similarity score. Alternatively, weights may be added to each converted form, or additional comparison features 402 may be used.

In one embodiment, the equation that is used to determine the similarity score, such as the equation in Table A, is analyzed by comparing with a human or editorial control set. The editorial control set may include a human review of the similarity scores for pairs of queries to determine an accuracy of the equation used for calculating the similarity score. In one embodiment, the human review may be used to optimize the equation that calculates the similarity score. Human editors may label query pairs with a relevance score. The relevance score may be used as a training label for the similarity score calculation, such as for the weights used in the equation in Table A. The editorial score may be a response variable and/or a dependent variable. The model may be fitted using linear regression.

FIG. 5 is an illustration for identifying related queries. In block 502, a user query is received. The user query may be Chinese-related and include at least one Chinese character. The user query may be received by a search engine 102. The user query may be compared with a selected candidate set of queries or search keywords in block 504. The candidate set may be selected form the search log database 112. In one embodiment, the candidate set may be chosen based on an initial comparison of similarity with the user query. The user query and/or the candidate set of queries may be converted into a different form or format for comparison, such as the conversion forms 302. The user query and a member of the candidate set are compared in block 508. In block 510, a similarity score is calculated to measure a similarity between the user query and the member of the candidate set. The similarity score may be based on utilizing any of the comparison features 402 for comparing a converted form of the user query with a converted form of the member. In block 512, another comparison at block 508 occurs for another member from the candidate set and continues until all members of the candidate set have been compared and have a similarity score. In block 514, the similarity scores between the candidate set may be reviewed to identify the member of the candidate set with the closest similarity score to the user query. The identification of a similar member, such as a similar search keyword, may be used to identify which advertisements to display for sponsored searching.

Referring to FIG. 6, an illustrative embodiment of a general computer system is shown and is designated 600. The user device 106, ad server 103, the search engine 102, the search log database 112, the data source 113, the unit dictionary 116, and/or the language analyzer 104 may be a computer or computing devices, such as the computer system 600 or any of its components. The computer system 600 can include a set of instructions that can be executed to cause the computer system 600 to perform any one or more of the methods or computer based functions disclosed herein. The computer system 600 may operate as a standalone device or may be connected, e.g., using a network, to other computer systems or peripheral devices.

In a networked deployment, the computer system may operate in the capacity of a server or as a client user computer in a server-client user network environment, or as a peer computer system in a peer-to-peer (or distributed) network environment. The computer system 600 can also be implemented as or incorporated into various devices, such as a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a mobile device, a palmtop computer, a laptop computer, a desktop computer, a communications device, a wireless telephone, a land-line telephone, a control system, a camera, a scanner, a facsimile machine, a printer, a pager, a personal trusted device, a web appliance, a network router, switch or bridge, or any other machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. In a particular embodiment, the computer system 600 can be implemented using electronic devices that provide voice, video or data communication. Further, while a single computer system 600 is illustrated, the term “system” shall also be taken to include any collection of systems or sub-systems that individually or jointly execute a set, or multiple sets, of instructions to perform one or more computer functions.

As illustrated in FIG. 6, the computer system 600 may include a processor 602, e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both. The processor 602 may be a component in a variety of systems. For example, the processor 602 may be part of a standard personal computer or a workstation. The processor 602 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data. The processor 602 may implement a software program, such as code generated manually (i.e., programmed).

The computer system 600 may include a memory 604 that can communicate via a bus 608. The memory 604 may be a main memory, a static memory, or a dynamic memory. The memory 604 may include, but is not limited to computer readable storage media such as various types of volatile and non-volatile storage media, including but not limited to random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. In one embodiment, the memory 604 includes a cache or random access memory for the processor 602. In alternative embodiments, the memory 604 is separate from the processor 602, such as a cache memory of a processor, the system memory, or other memory. The memory 604 may be an external storage device or database for storing data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store data. The memory 604 is operable to store instructions executable by the processor 602. The functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor 602 executing the instructions stored in the memory 604. The functions, acts or tasks are independent of the particular type of instructions set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like.

As shown, the computer system 600 may further include a display unit 614, such as a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information. The display 614 may act as an interface for the user to see the functioning of the processor 602, or specifically as an interface with the software stored in the memory 604 or in the drive unit 606.

Additionally, the computer system 600 may include an input device 616 configured to allow a user to interact with any of the components of system 600. The input device 616 may be a number pad, a keyboard, or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to interact with the system 600.

In a particular embodiment, as depicted in FIG. 6, the computer system 600 may also include a disk or optical drive unit 606. The disk drive unit 606 may include a computer-readable medium 610 in which one or more sets of instructions 612, e.g. software, can be embedded. Further, the instructions 612 may embody one or more of the methods or logic as described herein. In a particular embodiment, the instructions 612 may reside completely, or at least partially, within the memory 604 and/or within the processor 602 during execution by the computer system 600. The memory 604 and the processor 602 also may include computer-readable media as discussed above.

The present disclosure contemplates a computer-readable medium that includes instructions 612 or receives and executes instructions 612 responsive to a propagated signal, so that a device connected to a network 620 can communicate voice, video, audio, images or any other data over the network 620. Further, the instructions 612 may be transmitted or received over the network 620 via a communication port 618. The communication port 618 may be a part of the processor 602 or may be a separate component. The communication port 618 may be created in software or may be a physical connection in hardware. The communication port 618 is configured to connect with a network 620, external media, the display 614, or any other components in system 600, or combinations thereof. The connection with the network 620 may be a physical connection, such as a wired Ethernet connection or may be established wirelessly as discussed below. Likewise, the additional connections with other components of the system 600 may be physical connections or may be established wirelessly.

The network 620 may include wired networks, wireless networks, or combinations thereof. The wireless network may be a cellular telephone network, an 802.11, 802.16, 802.20, or WiMax network. Further, the network 620 may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols.

While the computer-readable medium is shown to be a single medium, the term “computer-readable medium” includes a single medium or multiple media, such as a centralized or distributed database, and/or associated caches and servers that store one or more sets of instructions. The term “computer-readable medium” shall also include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor or that cause a computer system to perform any one or more of the methods or operations disclosed herein.

In a particular non-limiting, exemplary embodiment, the computer-readable medium can include a solid-state memory such as a memory card or other package that houses one or more non-volatile read-only memories. Further, the computer-readable medium can be a random access memory or other volatile re-writable memory. Additionally, the computer-readable medium can include a magneto-optical or optical medium, such as a disk or tapes or other storage device to capture carrier wave signals such as a signal communicated over a transmission medium. A digital file attachment to an e-mail or other self-contained information archive or set of archives may be considered a distribution medium that is a tangible storage medium. Accordingly, the disclosure is considered to include any one or more of a computer-readable medium or a distribution medium and other equivalents and successor media, in which data or instructions may be stored.

In an alternative embodiment, dedicated hardware implementations, such as application specific integrated circuits, programmable logic arrays and other hardware devices, can be constructed to implement one or more of the methods described herein. Applications that may include the apparatus and systems of various embodiments can broadly include a variety of electronic and computer systems. One or more embodiments described herein may implement functions using two or more specific interconnected hardware modules or devices with related control and data signals that can be communicated between and through the modules, or as portions of an application-specific integrated circuit. Accordingly, the present system encompasses software, firmware, and hardware implementations.

In accordance with various embodiments of the present disclosure, the methods described herein may be implemented by software programs executable by a computer system. Further, in an exemplary, non-limited embodiment, implementations can include distributed processing, component/object distributed processing, and parallel processing. Alternatively, virtual computer system processing can be constructed to implement one or more of the methods or functionality as described herein.

Although the present specification describes components and functions that may be implemented in particular embodiments with reference to particular standards and protocols, the invention is not limited to such standards and protocols. For example, standards for Internet and other packet switched network transmission (e.g., TCP/IP, UDP/IP, HTML, HTTP) represent examples of the state of the art. Such standards are periodically superseded by faster or more efficient equivalents having essentially the same functions. Accordingly, replacement standards and protocols having the same or similar functions as those disclosed herein are considered equivalents thereof.

The illustrations of the embodiments described herein are intended to provide a general understanding of the structure of the various embodiments. The illustrations are not intended to serve as a complete description of all of the elements and features of apparatus and systems that utilize the structures or methods described herein. Many other embodiments may be apparent to those of skill in the art upon reviewing the disclosure. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure. Additionally, the illustrations are merely representational and may not be drawn to scale. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.

One or more embodiments of the disclosure may be referred to herein, individually and/or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any particular invention or inventive concept. Moreover, although specific embodiments have been illustrated and described herein, it should be appreciated that any subsequent arrangement designed to achieve the same or similar purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all subsequent adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the description.

The Abstract of the Disclosure is provided to comply with 37 C.F.R. § 1.72(b) and is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, various features may be grouped together or described in a single embodiment for the purpose of streamlining the disclosure. This disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter may be directed to less than all of the features of any of the disclosed embodiments. Thus, the following claims are incorporated into the Detailed Description, with each claim standing on its own as defining separately claimed subject matter.

The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments, which fall within the true spirit and scope of the present invention. Thus, to the maximum extent allowed by law, the scope of the present invention is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description. While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents. 

1. A method for matching queries with keywords comprising: receiving a non-native language user query; gathering a candidate set of the keywords to be compared with the user query; converting the user query to a form for comparison with the keywords, wherein the keywords are converted to the form for comparison; comparing the converted user query with each of the keywords, wherein a similarity score is established for each keyword to determine similarity with the user query; and matching at least one keyword from the keywords with the user query based on the similarity score.
 2. The method according to claim 1 wherein the non-native language user query comprises a Chinese related user query, wherein the Chinese related user query comprises at least one Chinese character.
 3. The method according to claim 1 wherein each of the keywords are associated with at least one advertisement.
 4. The method according to claim 3 further comprising: providing the at least one advertisement that is associated with the matched at least one keyword.
 5. The method according to claim 1 wherein the converting of the user query comprises at least one of adding, removing, or substituting at least one character from the user query.
 6. The method according to claim 1 wherein the similarity score for each of the keywords is based on an edit distance with the converted user query.
 7. In a computer readable storage medium having stored therein data representing instructions executable by a programmed processor for comparing a Chinese query with keywords, the storage medium comprising instructions operative for: receiving the Chinese query; selecting a set of the keywords for comparing with the Chinese query; converting the Chinese query into at least one different form; converting the set of keywords into the at least one different form; determining at least one comparison between the Chinese query and the set of keywords, wherein the at least one comparison comprises a similarity score between the Chinese query and the set of keywords; and identifying one of the set of keywords based on the similarity score.
 8. The storage medium according to claim 7 wherein the at least one comparison comprises calculating an edit distance between the converted Chinese query and each of the converted set of keywords.
 9. The storage medium according to claim 8 wherein the identified one of the set of keywords has a closest edit distance with the converted Chinese query.
 10. The storage medium according to claim 7 wherein the at least one different form comprises a conversion of at least one character to at least one of a Chinese soundex form, a zhuyin form, a radicals form, a pinyin without tone form, a pinyin with tone form, or a Chinese utf8 form.
 11. A method for determining similarity between queries comprising: selecting at least two queries from a set of queries according to one language system; converting each of the at least two queries into a different format, wherein the conversion comprises a transformation of certain characters in the at least two queries; determining at least one comparison feature for each of the at least two queries; and comparing the at least two queries based on the at least one comparison feature to determine a similarity between the at least two queries based on each of the at least one comparison feature.
 12. The method according to claim 11 wherein the language system comprises Chinese, and the set of queries are Chinese related queries.
 13. The method according to claim 12 wherein the transformation of certain characters in the at least two queries comprises changing at least one character into at least one of a Chinese soundex form, a zhuyin form, a radicals form, a pinyin without tone form, a pinyin with tone form, or a Chinese utf8 form.
 14. The method according to claim 11 wherein the at least one comparison feature comprises at least one of comparing an edit distance, comparing a character level prefix overlap, or comparing a character level suffix overlap.
 15. The method according to claim 14 wherein the comparing the edit distance further comprises comparing an edit distance by characters or an edit distance by words.
 16. The method according to claim 11 wherein the at least two queries are converted into a plurality of different formats, wherein the comparing further comprises comparing the at least two queries in each of the plurality of different formats.
 17. A method for comparing queries comprising: receiving at least two queries, wherein each of the at least two queries comprise at least one Chinese representation; converting the at least two queries into at least one common format; calculating an edit distance between the converted at least two queries for each of the at least one common format; and recording the edit distances between each of the converted at least two queries.
 18. The method according to claim 17 wherein the at least one common format comprises an addition, subtraction, or substitution of at least one character.
 19. The method according to claim 17 wherein the at least one common format comprises a conversion of at least one character into at least one of a Chinese soundex form, a zhuyin form, a radicals form, a pinyin without tone form, a pinyin with tone form, or a Chinese utf8 form
 20. A system for measuring related queries comprising: a search engine operative to receive a user search query; an ad server coupled with the search engine and operative to provide an advertisement for display in response to the received user search query, wherein the ad server includes a plurality of search keywords, each of which are associated with at least one advertisement; a search log database coupled with the search engine and operative to store search queries including the plurality of search keywords; and a language analyzer coupled with the search engine that comprises: a receiver operative to receive the user search query; a converter coupled with the receiver and operative to convert the user search query into a different form; a comparator coupled with the converter and operative to compare the converted search query with a candidate set of the plurality of search keywords; and a calculator coupled with the comparator and operative to calculate a similarity score for each member of the candidate set based on the comparison with the converted search query; wherein the associated at least one advertisement that is associated with the member of the candidate set with a closest similarity score is provided for display in response to the received search query.
 21. The system according to claim 20 wherein the user search query is Chinese related and the converter is operative to change the Chinese related user search query into the different form by at least one of adding, deleting or substituting at least one of the characters of the Chinese related user search query.
 22. The system according to claim 20 wherein the calculation of the similarity score comprises a computation of an edit distance between the converted user search query and the member of the candidate set.
 23. The system according to claim 20 wherein the converter is operative to convert the candidate set of the plurality of search keywords into the different form for comparison with the converted user search query. 